Please download the data in data.zip. Then, move data.zip to your Desktop and unzip it (usually double-clicking it will work).
All workshop material and data are available at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop.
When you open RStudio, it should look like so with multiple panels. If you see only 1 panel, then you’re likely in R, not RStudio.
Install R packages by running the following script in your R console (left panel in the above image).
#CRAN packages
install.packages("tidyverse")
#Bioconductor packages
install.packages("BiocManager")
BiocManager::install("limma")
To make sure packages are correctly installed, load them into R with
library( ). If you see any ERROR, please come 15
minutes early to the workshop the day of or contact Kim for
assistance.
First, the package(s) that give messages upon loading.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Then, check package(s) that load silently with no messages.
library(limma)
For reproducibility, here is the complete list of software used in this workshop.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] limma_3.52.2 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
## [5] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
## [9] ggplot2_3.3.6 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.2 xfun_0.31 bslib_0.3.1 haven_2.5.0
## [5] colorspace_2.0-3 vctrs_0.4.1 generics_0.1.3 htmltools_0.5.2
## [9] yaml_2.3.5 utf8_1.2.2 rlang_1.0.4 jquerylib_0.1.4
## [13] pillar_1.7.0 withr_2.5.0 glue_1.6.2 DBI_1.1.3
## [17] dbplyr_2.2.1 modelr_0.1.8 readxl_1.4.0 lifecycle_1.0.1
## [21] cellranger_1.1.0 munsell_0.5.0 gtable_0.3.0 rvest_1.0.2
## [25] evaluate_0.15 knitr_1.39 tzdb_0.3.0 fastmap_1.1.0
## [29] fansi_1.0.3 broom_1.0.0 backports_1.4.1 scales_1.2.0
## [33] jsonlite_1.8.0 fs_1.5.2 hms_1.1.1 digest_0.6.29
## [37] stringi_1.7.8 rprojroot_2.0.3 grid_4.2.1 cli_3.3.0
## [41] tools_4.2.1 magrittr_2.0.3 sass_0.4.1 crayon_1.5.1
## [45] pkgconfig_2.0.3 ellipsis_0.3.2 xml2_1.3.3 reprex_2.0.1
## [49] lubridate_1.8.0 assertthat_0.2.1 rmarkdown_2.14 httr_1.4.3
## [53] rstudioapi_0.13 R6_2.5.1 compiler_4.2.1
In this workshop, we introduce you to R and RStudio at the beginner level. In it, we cover:
We will do all of our work in RStudio. RStudio is an integrated development and analysis environment for R that brings a number of conveniences over using R in a terminal or other editing environments.
During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes
Please install R and RStudio. See the setup instructions for more details.
When you start RStudio, you will see something like the following window appear:
Notice that the window is divided into three “panes”:
Console (the entire left side): this is your view into the R engine. You can type in R commands here and see the output printed by R. (To make it easier to tell them apart, your input is printed in blue, while the output is black.) There are several editing conveniences available: use up and down arrow keys to go back to previously entered commands, which can then be edited and re-run; TAB for completing the name before the cursor; see more in online docs.
Environment/History (tabbed in upper right): view current user-defined objects and previously-entered commands, respectively.
Files/Plots/Packages/Help (tabbed in lower right): as their names suggest, these are used to view the contents of the current directory, graphics created by the user, install packages, and view the built-in help pages.
To change the look of RStudio, you can go to Tools -> Global Options -> Appearance and select colors, font size, etc. If you plan to be working for longer periods, we suggest choosing a dark background color scheme to save your computer battery and your eyes.
Projects are a great feature of RStudio. When you create a project,
RStudio creates an .Rproj file that links all of your files
and outputs to the project directory. When you import data, R
automatically looks for the file in the project directory instead of you
having to specify a full file path on your computer like
/Users/username/Desktop/. R also automatically saves any
output to the project directory. Finally, projects allow you to save
your R environment in .RData so that when you close RStudio
and then re-open it, you can start right where you left off without
re-importing any data or re-calculating any intermediate steps.
RStudio has a simple interface to create and switch between projects, accessed from the button in the top-right corner of the RStudio window. (Labeled “Project: (None)”, initially.)
Let’s create a project to work in for this workshop. Start by clicking the “Project” button in the upper right or going to the “File” menu. Select “New Project” and the following will appear.
You can either create a project in an existing directory or make a new directory on your computer - just be sure you know where it is.
After your project is created, navigate to its directory using your
Finder/File explorer. You will see the .RProj file has been
created.
To access this project in the future, simply double-click the
RProj and RStudio will open the project or choose File >
Open Project from within an already open RStudio window.
R script files are the primary way in which R facilitates reproducible research. They contain the code that loads your raw data, cleans it, performs the analyses, and creates and saves visualizations. R scripts maintain a record of everything that is done to the raw data to reach the final result. That way, it is very easy to write up and communicate your methods because you have a document listing the precise steps you used to conduct your analyses. This is one of R’s primary advantages compared to traditional tools like Excel, where it may be unclear how to reproduce the results.
Generally, if you are testing an operation (e.g. what would my data look like if I applied a log-transformation to it?), you should do it in the console (left pane of RStudio). If you are committing a step to your analysis (e.g. I want to apply a log-transformation to my data and then conduct the rest of my analyses on the log-transformed data), you should add it to your R script so that it is saved for future use.
Additionally, you should annotate your R scripts with comments. In
each line of code, any text preceded by the # symbol will
not execute. Comments can be useful to remind yourself and to tell other
readers what a specific chunk of code does.
Let’s create an R script (File > New File > R Script) and save
it as introR_live_notes.R in your main project directory.
If you again look to the project directory on your computer, you will
see introR_live_notes.R is now saved there.
We will work together to create and populate the
introR_live_notes.R script throughout this workshop.
R packages are units of shareable code, containing functions that
facilitate and enhance analyses. Let’s install tidyverse,
which is actually a meta-package containing several packages useful in
data manipulation and plotting. Packages are typically installed from CRAN (The Comprehensive R Archive
Network), which is a database containing R itself as well as many R
packages. Any package can be installed from CRAN using the
install.packages function. You can input this into your
console (as opposed to live_notes.R) since once a package
is installed on your computer, you won’t need to re-install it
again.
install.packages("tidyverse", Ncpus=2)
This can take several minutes.
After installing a package, and every time you open a new
RStudio session, the packages you want to use need to be loaded into the
R workspace with the library function. This tells R to
access the package’s functions and prevents RStudio from lags that would
occur if it automatically loaded every downloaded package every time you
opened it.
# Data manipulation and visualization
library(tidyverse)
Bioconductor is another repository of R packages. It has different requirements for upload and houses many of the biology-relevant packages. To install from Bioconductor, you first install its installer from CRAN.
install.packages("BiocManager")
Then install your package of choice using it’s installer. Here, we
install limma, a package for analysis of microarray and
RNA-seq data.
If prompted, say a to “Update all/some/none?
[a/s/n]” and no to “Do you want to install from sources the
packages which need compilation? (Yes/no/cancel)”
BiocManager::install("limma")
Before doing anything in R, it is a good idea to set your random seed. Your analyses may not end up using a seed but by setting it, you ensure that everything is exactly reproducible.
set.seed(4389)
Create a directory inside your project directory called
data and move the data files into this directory.
.csv, .tsv, etc.One of R’s most essential data structures is the data frame, which is
simply a table of m columns by n rows. First,
we will read in the RNA-seq metadata into RStudio using the base R
read.table function.
Each R function follows the following basic syntax, where
Function is the name of the function.
Function(argument1=..., argument2=..., ...)
read.table has many arguments; however, we only need to
specify 3 arguments to correctly read in our data as a data frame. For
our data, we will need to specify:
file - gives the path to the file that we want to load
from our working directory (current project directory).sep - tells R that our data are comma-separatedheader - tells R that the first row in our data
contains the names of the variables (columns).We will store the data as an object named meta
using the assignment operator <-, so that we can re-use
it in our analysis.
# read the data and save it as an object
meta <- read.table(file="data/metadata.csv",
sep=",", header=TRUE)
Now whenever we want to use these data, we simply call
meta
You can read up about the different arguments of a specific function
by typing ?Function or help(Function) in your
R console.
?read.table
You will notice that there are multiple functions of the
read.table help page. This include similar and related
functions with additional options. For example, since our data are in
.csv format, we could’ve instead read them into R with
read.csv which assumes the options
sep=",", header=TRUE by default.
# read the data with different function
meta <- read.csv(file="data/metadata.csv")
.RDataYou may have data that do not fit nicely into a single table or into
a table at all (like plots). You can save these as .RData,
which can be loaded directly into R. You can also save multiple tables
and/or other objects in a single .RData file to make
loading your data quick and easy. Moreover, .RData are
automatically compressed so they take up less storage space than
multiple tables.
load("data/dat_voom.RData")
Notice that these data appear already named in your R environment as
dat. Object names are determined when saving so be sure to
create short but descriptive names before saving to
.RData.
See the objects data type with class.
class(dat)
## [1] "EList"
## attr(,"package")
## [1] "limma"
Let’s return to the simpler metadata for now. This data frame
consists of 20 rows (observations) and 9 columns (variables). You can
see this quickly using the dimension function dim
dim(meta)
## [1] 20 9
Each column and each row of a data frame are individual R vectors. R
vectors are one-dimensional arrays of data. For example, we can extract
column vectors from data frames using the $ operator.
# Extract patient IDs
meta$FULLIDNO
## NULL
R objects have several different classes (types). Our data frame
contains 2 R data types. The base R class function will
tell you what data type an object is.
class(meta)
## [1] "data.frame"
class(meta$libID)
## [1] "character"
class(meta$lib.size)
## [1] "NULL"
We see that our libID column is character,
meaning it is non-numeric. On the other hand, lib.size is
numeric, meaning a number.
Common data types not found in these data include the following. We will see these later on.
factor: non-numeric value with a set number of unique
levelsinteger: whole number numericlogical: TRUE/FALSE designationS3, S4)Now, let’s look at the limma EList data. These data are
in S3 format meaning they have 3 dimensions. In essence,
they are a list of multiple data frames and vectors. If you click on the
dat object in your Environment tab, you will see multiple
pieces.
In these data, pieces are data frames or matrices. You can again see
this with class only this time you specify a part of the
dat object with $
class(dat$genes)
## [1] "data.frame"
class(dat$E)
## [1] "matrix" "array"
class(dat$targets)
## [1] "data.frame"
class(dat$weights)
## [1] "matrix" "array"
class(dat$design)
## [1] "matrix" "array"
Notice that you get the message
Loading required package: limma. If you did not have
limma installed, you could not work with these data because
they are a data type specific to limma. Hence, why we
installed this package previously.
Or going deeper, you can specify one column in the genes
data frame
class(dat$genes$hgnc_symbol)
## [1] "character"
Of note, working with S4 objects is very similar to
S3 except that they are accessed with @
instead of $. However, we will not use S4 in
this workshop.
A large proportion of R functions operate on vectors to perform quick computations over their values. Here are some examples:
# Compute the variance of total number of sequences
var(meta$total_seq)
## [1] 7.843553e+12
# Find whether any samples have greater than 10 million sequences
meta$total_seq > 10E6
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## [13] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
# Find the unique values of metadata's condition
unique(meta$condition)
## [1] "Media" "Mtb"
Functions executed on an object in R may respond exclusively to one or more data types or may respond differently depending on the data type. When you use the incorrect data type, you will get an error or warning message. For example, you cannot take the mean of a factor or character.
# Compute the mean of libID
mean(meta$libID)
## Warning in mean.default(meta$libID): argument is not numeric or logical:
## returning NA
## [1] NA
Since vectors are 1D arrays of a defined length, their individual
values can be retrieved using vector indices. R uses 1-based indexing,
meaning the first value in an R vector corresponds to the index 1.
(Importantly, if you use python, that language is 0-based, meaning the
first value is index 0.) Each subsequent element increases the index by
1. For example, we can extract the value of the 5th element of the
libID vector using the square bracket operator
[ ] like so.
meta$libID[5]
## [1] "pt03_Media"
In contrast, data frames are 2D arrays so indexing is done across
both dimensions as [rows, columns]. So, we can extract the
same oxygen value directly from the data frame knowing it is in the 5th
row and 1st column.
meta[5, 1]
## [1] "pt03_Media"
The square bracket operator is often used with logical vectors
(TRUE/FALSE) to subset data. For example, we can subset our metadata to
all MEDIA observations (rows).
# Create logical vector for which lib.size values are > 10 million
logical.vector <- meta$condition == "MEDIA"
#View vector
logical.vector
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#Apply vector to data frame to select only observations where the logical vector is TRUE
meta[logical.vector, ]
## [1] libID ptID condition age_dys sex ptID_old
## [7] RNAseq methylation total_seq
## <0 rows> (or 0-length row.names)
Subsetting is extremely useful when working with large data. We will learn more complex subsets on day 2 using the tidyverse. But first…
| Statement | Meaning |
|---|---|
<- |
Assign to object in environment |
== |
Equal to |
!= |
Not equal to |
> |
Greater than |
>= |
Greater than or equal to |
< |
Less than |
<= |
Less than or equal to |
%in% |
In or within |
is.na() |
Is missing, e.g NA |
!is.na() |
Is not missing |
& |
And |
| |
Or |
Using the meta data frame:
norm.factors variable is.[]:
RSID value occurs in the 20th rowlib.size equals 14509963RSID equals
“RS102521” or “RS102484”. Hint: Use a logical vector.The tidyverse is not just one package, but a collection of several packages that are designed to work well together for data analysis tasks. Let’s load the tidyverse and read in the metadata again.
For more information on installation, see the setup instructions.
library(tidyverse)
meta <- read_csv("data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a look at the dataset with View():
View(meta)
Here is a brief description of the variables in these data.
libID (character): Unique library ID for each
sequencing library. Formatted as ptID_conditionptID (character): Patient ID. Each patient has 2
libraries which differ in their conditioncondition (character): Experimental condition. Media
for media control. Mtb of M. tuberculosis infected.age_dys (numeric): Age in agessex (character): Biological sex, M or FptID_old (character): Old patient ID with leading
zeroesRNAseq (logical): If the library has pass-filter
RNA-seq datamethylation (logical): If the library has pass-filter
methylation datatotal_seq (numeric): Total number of pass-filter
sequencesYou can use the select function to keep only the columns
that you want and remove columns that you don’t need.
select(meta, ptID)
## # A tibble: 20 × 1
## ptID
## <chr>
## 1 pt01
## 2 pt01
## 3 pt02
## 4 pt02
## 5 pt03
## 6 pt03
## 7 pt04
## 8 pt04
## 9 pt05
## 10 pt05
## 11 pt06
## 12 pt06
## 13 pt07
## 14 pt07
## 15 pt08
## 16 pt08
## 17 pt09
## 18 pt09
## 19 pt10
## 20 pt10
The first argument to the select function is our data
frame, and the second argument is the name of the column we want to
keep. We can continue to list multiple columns to keep by separating
with a comma.
select(meta, ptID, total_seq)
## # A tibble: 20 × 2
## ptID total_seq
## <chr> <dbl>
## 1 pt01 9114402.
## 2 pt01 8918699.
## 3 pt02 9221555.
## 4 pt02 7733260.
## 5 pt03 6231728.
## 6 pt03 7105193.
## 7 pt04 10205557.
## 8 pt04 8413543.
## 9 pt05 15536685.
## 10 pt05 15509446.
## 11 pt06 7085995.
## 12 pt06 6588160.
## 13 pt07 10706098.
## 14 pt07 8576245.
## 15 pt08 9957906.
## 16 pt08 8220348.
## 17 pt09 13055276
## 18 pt09 13800442.
## 19 pt10 8216706.
## 20 pt10 7599609.
If you want to keep all but one column, put the minus
(-) sign in front of it. I like to think of it as
“subtracting” the column from the data frame.
select(meta, -ptID_old)
## # A tibble: 20 × 8
## libID ptID condition age_dys sex RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402.
## 2 pt01_Mtb pt01 Mtb 12410 M TRUE FALSE 8918699.
## 3 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555.
## 4 pt02_Mtb pt02 Mtb 12775 M TRUE FALSE 7733260.
## 5 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728.
## 6 pt03_Mtb pt03 Mtb 11315 M TRUE FALSE 7105193.
## 7 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557.
## 8 pt04_Mtb pt04 Mtb 8395 M TRUE TRUE 8413543.
## 9 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685.
## 10 pt05_Mtb pt05 Mtb 7300 M TRUE FALSE 15509446.
## 11 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995.
## 12 pt06_Mtb pt06 Mtb 6570 F TRUE FALSE 6588160.
## 13 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098.
## 14 pt07_Mtb pt07 Mtb 7665 F TRUE FALSE 8576245.
## 15 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906.
## 16 pt08_Mtb pt08 Mtb 8760 M TRUE FALSE 8220348.
## 17 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276
## 18 pt09_Mtb pt09 Mtb 6935 M TRUE FALSE 13800442.
## 19 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706.
## 20 pt10_Mtb pt10 Mtb 8030 F TRUE FALSE 7599609.
You can use filter to remove/keep rows. Instead of
specifying a column name as in select, you provide a
statement that gives TRUE/FALSE and filter keeps all the
rows for which that statement is TRUE. The conditional
operators we saw in Intro R will help with this!
Let’s filter to just the Media samples.
filter(meta, condition == "Media")
## # A tibble: 10 × 9
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Med… pt01 Media 12410 M pt00001 TRUE FALSE 9114402.
## 2 pt02_Med… pt02 Media 12775 M pt00002 TRUE FALSE 9221555.
## 3 pt03_Med… pt03 Media 11315 M pt00003 TRUE FALSE 6231728.
## 4 pt04_Med… pt04 Media 8395 M pt00004 TRUE TRUE 10205557.
## 5 pt05_Med… pt05 Media 7300 M pt00005 TRUE FALSE 15536685.
## 6 pt06_Med… pt06 Media 6570 F pt00006 TRUE FALSE 7085995.
## 7 pt07_Med… pt07 Media 7665 F pt00007 TRUE FALSE 10706098.
## 8 pt08_Med… pt08 Media 8760 M pt00008 TRUE FALSE 9957906.
## 9 pt09_Med… pt09 Media 6935 M pt00009 TRUE FALSE 13055276
## 10 pt10_Med… pt10 Media 8030 F pt00010 TRUE FALSE 8216706.
You can string together multiple statements and ask if both/all are
TRUE with and & or if at least one is TRUE with or
|. Notice that the & statement removes all
rows because no library has both a Media and Mtb condition. In contrast,
the | statement keeps all rows because all libraries are
either the Media or Mtb condition.
filter(meta, condition == "Media" & condition == "Mtb")
## # A tibble: 0 × 9
## # … with 9 variables: libID <chr>, ptID <chr>, condition <chr>, age_dys <dbl>,
## # sex <chr>, ptID_old <chr>, RNAseq <lgl>, methylation <lgl>, total_seq <dbl>
filter(meta, condition == "Media" | condition == "Mtb")
## # A tibble: 20 × 9
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Med… pt01 Media 12410 M pt00001 TRUE FALSE 9114402.
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8918699.
## 3 pt02_Med… pt02 Media 12775 M pt00002 TRUE FALSE 9221555.
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7733260.
## 5 pt03_Med… pt03 Media 11315 M pt00003 TRUE FALSE 6231728.
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7105193.
## 7 pt04_Med… pt04 Media 8395 M pt00004 TRUE TRUE 10205557.
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8413543.
## 9 pt05_Med… pt05 Media 7300 M pt00005 TRUE FALSE 15536685.
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 15509446.
## 11 pt06_Med… pt06 Media 6570 F pt00006 TRUE FALSE 7085995.
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6588160.
## 13 pt07_Med… pt07 Media 7665 F pt00007 TRUE FALSE 10706098.
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8576245.
## 15 pt08_Med… pt08 Media 8760 M pt00008 TRUE FALSE 9957906.
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8220348.
## 17 pt09_Med… pt09 Media 6935 M pt00009 TRUE FALSE 13055276
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 13800442.
## 19 pt10_Med… pt10 Media 8030 F pt00010 TRUE FALSE 8216706.
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7599609.
Quotes vs No Quotes
Notice that we put the word “Media” inside quotes. This is because we are not using a column from inside our data frame. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.
The general rule is that if you want to use values from the columns of your data object, then you supply the name of the column without quotes, but if you want to specify a value that does not come from your data, then use quotes.
Sometimes your variable names are not ideal. rename does
just want it says; it renames columns in your data. The syntax for
rename is to give the new name first and then tell R what
that new name should replace, so new_name = old_name.
Here, were rename condition to infection as this better represents those
data.
rename(meta, infection = condition)
## # A tibble: 20 × 9
## libID ptID infection age_dys sex ptID_old RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Med… pt01 Media 12410 M pt00001 TRUE FALSE 9114402.
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8918699.
## 3 pt02_Med… pt02 Media 12775 M pt00002 TRUE FALSE 9221555.
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7733260.
## 5 pt03_Med… pt03 Media 11315 M pt00003 TRUE FALSE 6231728.
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7105193.
## 7 pt04_Med… pt04 Media 8395 M pt00004 TRUE TRUE 10205557.
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8413543.
## 9 pt05_Med… pt05 Media 7300 M pt00005 TRUE FALSE 15536685.
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 15509446.
## 11 pt06_Med… pt06 Media 6570 F pt00006 TRUE FALSE 7085995.
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6588160.
## 13 pt07_Med… pt07 Media 7665 F pt00007 TRUE FALSE 10706098.
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8576245.
## 15 pt08_Med… pt08 Media 8760 M pt00008 TRUE FALSE 9957906.
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8220348.
## 17 pt09_Med… pt09 Media 6935 M pt00009 TRUE FALSE 13055276
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 13800442.
## 19 pt10_Med… pt10 Media 8030 F pt00010 TRUE FALSE 8216706.
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7599609.
The function summarize() performs summary statistics on
your data. We can use it to find the mean, median, or other statistics
about the dataset. Let’s find the average age in days below:
summarize(meta, mean(age_dys))
## # A tibble: 1 × 1
## `mean(age_dys)`
## <dbl>
## 1 9016.
Notice that summarize automatically names the column
with the summary statistic based on the function used to calculate that
statistic. This is not ideal as the formatting has parentheses. So we
can specify a name instead.
summarize(meta, mean_age_dys = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_dys
## <dbl>
## 1 9016.
Instead of including the data as an argument, we can use the pipe
operator %>% to pass the data value into the
summarize function.
meta %>% summarize(mean_age_dys = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_dys
## <dbl>
## 1 9016.
This line of code will do the exact same thing as our previous
summary command, but the piping function tells R to use the
meta data frame as the first argument in the next
function.
This lets us “chain” together multiple functions. The pipe
(%>%) takes the output from the left side and use it as
input to the right side.
We can also add an
meta %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_days
## <dbl>
## 1 9016.
Using the pipe operator %>% and enter
command makes our code more readable. The pipe operator
%>% also helps to avoid using nested function and
minimizes the need for new variables.
Since we use the pipe operator so often, there is a keyboard shortcut
for it in RStudio. You can press
The comes in handy with functions like group_by. If we
want to calculate the mean age in days for males and females separately,
we can use group_by before summarize:
meta %>%
group_by(sex) %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
## sex mean_age_days
## <chr> <dbl>
## 1 F 7422.
## 2 M 9699.
And we could merge this with a filter to only look at the Media libraries, since each age is repeated for that patient’s Media and Mtb libraries. The results is the same in this case but filtering would be important if any of our patient’s were missing one of the conditions.
meta %>%
filter(condition == "Media") %>%
group_by(sex) %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
## sex mean_age_days
## <chr> <dbl>
## 1 F 7422.
## 2 M 9699.
Sometimes we would like to create a new column based on the values in an existing one. For example, age is days is a little difficult to understand. Let’s add a column with age in years.
meta %>%
mutate(age_yrs = age_dys / 365.25)
## # A tibble: 20 × 10
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Med… pt01 Media 12410 M pt00001 TRUE FALSE 9114402.
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8918699.
## 3 pt02_Med… pt02 Media 12775 M pt00002 TRUE FALSE 9221555.
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7733260.
## 5 pt03_Med… pt03 Media 11315 M pt00003 TRUE FALSE 6231728.
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7105193.
## 7 pt04_Med… pt04 Media 8395 M pt00004 TRUE TRUE 10205557.
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8413543.
## 9 pt05_Med… pt05 Media 7300 M pt00005 TRUE FALSE 15536685.
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 15509446.
## 11 pt06_Med… pt06 Media 6570 F pt00006 TRUE FALSE 7085995.
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6588160.
## 13 pt07_Med… pt07 Media 7665 F pt00007 TRUE FALSE 10706098.
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8576245.
## 15 pt08_Med… pt08 Media 8760 M pt00008 TRUE FALSE 9957906.
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8220348.
## 17 pt09_Med… pt09 Media 6935 M pt00009 TRUE FALSE 13055276
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 13800442.
## 19 pt10_Med… pt10 Media 8030 F pt00010 TRUE FALSE 8216706.
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7599609.
## # … with 1 more variable: age_yrs <dbl>
Note that this new column is not saved in meta because we printed the
result to the console and did not save it to the environment. We can
overwrite the meta data frame so it will now contain the
age_yrs column.
meta <- meta %>%
mutate(age_yrs = age_dys / 365.25)
meta
## # A tibble: 20 × 10
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Med… pt01 Media 12410 M pt00001 TRUE FALSE 9114402.
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8918699.
## 3 pt02_Med… pt02 Media 12775 M pt00002 TRUE FALSE 9221555.
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7733260.
## 5 pt03_Med… pt03 Media 11315 M pt00003 TRUE FALSE 6231728.
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7105193.
## 7 pt04_Med… pt04 Media 8395 M pt00004 TRUE TRUE 10205557.
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8413543.
## 9 pt05_Med… pt05 Media 7300 M pt00005 TRUE FALSE 15536685.
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 15509446.
## 11 pt06_Med… pt06 Media 6570 F pt00006 TRUE FALSE 7085995.
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6588160.
## 13 pt07_Med… pt07 Media 7665 F pt00007 TRUE FALSE 10706098.
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8576245.
## 15 pt08_Med… pt08 Media 8760 M pt00008 TRUE FALSE 9957906.
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8220348.
## 17 pt09_Med… pt09 Media 6935 M pt00009 TRUE FALSE 13055276
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 13800442.
## 19 pt10_Med… pt10 Media 8030 F pt00010 TRUE FALSE 8216706.
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7599609.
## # … with 1 more variable: age_yrs <dbl>
total_seq_millions based on
total_seq.condition == "Media"?Data comes in many shapes and sizes, and one way we classify data is
either “wide” or “long.” Data that is “long” has one row per
observation. The metadata is in a long format. We have one row for each
patient sample and each variable is in a different column. We might also
describe these data as “tidy” because it makes it easy to work with
ggplot2 and dplyr functions (this is where the
“tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we
may want our data in a “wide” format. Typically, in “wide” format, each
row represents a group of observations and each value is placed in a
different column rather than a different row. For example, maybe we want
each patient to have 1 row and their Media/Mtb data to be spread across
columns.
The tidyr package in the tidyverse contains the
functions pivot_wider and pivot_longer that
make it easy to switch between the two formats.
For each patient, we have two samples: Media and
Mtb. In the metadata, the only difference in these two
conditions is the libID column (which is redundant with the
ptID and condition) and the
total_seq column. We can take the condition column and
create two new columns, one with the total seqs in the media sample and
one with the total seqs in the mtb sample. This is called “pivoting” the
data frame “wider” because we rearrange it by creating an additional
column and making it have fewer rows. Let’s try it!
meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq)
## # A tibble: 10 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs Media Mtb
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 9114402. 8918699.
## 2 pt02 12775 M pt00002 TRUE FALSE 35.0 9221555. 7733260.
## 3 pt03 11315 M pt00003 TRUE FALSE 31.0 6231728. 7105193.
## 4 pt04 8395 M pt00004 TRUE TRUE 23.0 10205557. 8413543.
## 5 pt05 7300 M pt00005 TRUE FALSE 20.0 15536685. 15509446.
## 6 pt06 6570 F pt00006 TRUE FALSE 18.0 7085995. 6588160.
## 7 pt07 7665 F pt00007 TRUE FALSE 21.0 10706098. 8576245.
## 8 pt08 8760 M pt00008 TRUE FALSE 24.0 9957906. 8220348.
## 9 pt09 6935 M pt00009 TRUE FALSE 19.0 13055276 13800442.
## 10 pt10 8030 F pt00010 TRUE FALSE 22.0 8216706. 7599609.
Notice here that we tell pivot_wider() which columns to
pull the names we wish our new columns to be named from the year
variable, and the values to populate those columns from the condition
variable. (Again, neither of which have to be in quotes in the code when
there are no special characters or spaces - certainly an incentive not
to use special characters or spaces!) We see that the resulting table
has new columns by year, and the values populate it with our remaining
variables dictating the rows.
Maybe we should assign those columns more informative names:
meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq,
names_prefix = "total_seq_")
## # A tibble: 10 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs total_seq_Media
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 9114402.
## 2 pt02 12775 M pt00002 TRUE FALSE 35.0 9221555.
## 3 pt03 11315 M pt00003 TRUE FALSE 31.0 6231728.
## 4 pt04 8395 M pt00004 TRUE TRUE 23.0 10205557.
## 5 pt05 7300 M pt00005 TRUE FALSE 20.0 15536685.
## 6 pt06 6570 F pt00006 TRUE FALSE 18.0 7085995.
## 7 pt07 7665 F pt00007 TRUE FALSE 21.0 10706098.
## 8 pt08 8760 M pt00008 TRUE FALSE 24.0 9957906.
## 9 pt09 6935 M pt00009 TRUE FALSE 19.0 13055276
## 10 pt10 8030 F pt00010 TRUE FALSE 22.0 8216706.
## # … with 1 more variable: total_seq_Mtb <dbl>
And now let’s save the new wider metadata to a new variable:
meta_wide <- meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq)
Notice that the number of rows and columns has changed:
nrow(meta)
## [1] 20
ncol(meta)
## [1] 10
nrow(meta_wide)
## [1] 10
ncol(meta_wide)
## [1] 9
Everything we did with pivot_wider, we can reverse with
pivot_longer. Maybe you receive a data frame in a wide
format, but you need it in a long format. Here’s how we can get back to
a long data frame:
meta_wide %>%
pivot_longer(c(Media, Mtb),
names_to = 'condition', values_to = 'total_seq')
## # A tibble: 20 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs condition total_seq
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <chr> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 Media 9114402.
## 2 pt01 12410 M pt00001 TRUE FALSE 34.0 Mtb 8918699.
## 3 pt02 12775 M pt00002 TRUE FALSE 35.0 Media 9221555.
## 4 pt02 12775 M pt00002 TRUE FALSE 35.0 Mtb 7733260.
## 5 pt03 11315 M pt00003 TRUE FALSE 31.0 Media 6231728.
## 6 pt03 11315 M pt00003 TRUE FALSE 31.0 Mtb 7105193.
## 7 pt04 8395 M pt00004 TRUE TRUE 23.0 Media 10205557.
## 8 pt04 8395 M pt00004 TRUE TRUE 23.0 Mtb 8413543.
## 9 pt05 7300 M pt00005 TRUE FALSE 20.0 Media 15536685.
## 10 pt05 7300 M pt00005 TRUE FALSE 20.0 Mtb 15509446.
## 11 pt06 6570 F pt00006 TRUE FALSE 18.0 Media 7085995.
## 12 pt06 6570 F pt00006 TRUE FALSE 18.0 Mtb 6588160.
## 13 pt07 7665 F pt00007 TRUE FALSE 21.0 Media 10706098.
## 14 pt07 7665 F pt00007 TRUE FALSE 21.0 Mtb 8576245.
## 15 pt08 8760 M pt00008 TRUE FALSE 24.0 Media 9957906.
## 16 pt08 8760 M pt00008 TRUE FALSE 24.0 Mtb 8220348.
## 17 pt09 6935 M pt00009 TRUE FALSE 19.0 Media 13055276
## 18 pt09 6935 M pt00009 TRUE FALSE 19.0 Mtb 13800442.
## 19 pt10 8030 F pt00010 TRUE FALSE 22.0 Media 8216706.
## 20 pt10 8030 F pt00010 TRUE FALSE 22.0 Mtb 7599609.
Often not all your data is pre-compiled in one nice, tidy data frame. You may have multiple files, batches, etc to work with. dplyr’s join functions allow use to combined data frame without fear of copy-paste errors or missing anything!
So far, we’ve been working with just the metadata - a single data
frame. Let’s adventure into the full RNA-seq dat object so
we have multiple data frames to combine.
load("data/dat_voom.RData")
First, we have the normalized log2 counts per million (CPM) in
E
dat$E[1:3,1:3]
## pt01_Media pt01_Mtb pt02_Media
## A1BG 1.620022 1.717603 1.086946
## A2M 7.259912 6.680758 5.939909
## A2ML1 -4.052404 -3.515057 -4.015712
Second, we have the same metadata as our meta object in
the targets data.
dat$targets[1:3,]
## group lib.size norm.factors libID ptID condition age_dys sex
## pt01_Media 1 8295928 1.092872 pt01_Media pt01 Media 12410 M
## pt01_Mtb 1 5716203 0.819490 pt01_Mtb pt01 Mtb 12410 M
## pt02_Media 1 8087601 1.119352 pt02_Media pt02 Media 12775 M
## ptID_old RNAseq methylation total_seq sample.weights
## pt01_Media pt00001 TRUE FALSE 9114402 0.8943797
## pt01_Mtb pt00001 TRUE FALSE 8918699 0.9850363
## pt02_Media pt00002 TRUE FALSE 9221555 1.2582092
Looking at these two data frames, they do not share any columns - thus there is nothing to join them with. However, notice that the column names in the counts data are the libIDs from the metadata! We can use pivot to get the joining column we need.
Notice that since E is a matrix with rownames, we must
force it to be a data frame and move the rownames into a dat column.
E_long <- as.data.frame(dat$E) %>%
rownames_to_column("gene") %>%
pivot_longer(-gene, names_to = "libID", values_to = "log2CPM")
E_long
## # A tibble: 268,380 × 3
## gene libID log2CPM
## <chr> <chr> <dbl>
## 1 A1BG pt01_Media 1.62
## 2 A1BG pt01_Mtb 1.72
## 3 A1BG pt02_Media 1.09
## 4 A1BG pt02_Mtb 1.10
## 5 A1BG pt03_Media 0.874
## 6 A1BG pt03_Mtb 1.71
## 7 A1BG pt04_Media 1.57
## 8 A1BG pt04_Mtb -0.264
## 9 A1BG pt05_Media 1.63
## 10 A1BG pt05_Mtb -0.0305
## # … with 268,370 more rows
Now both data frames have a libID column!
Let’s combine our sequence counts with our metadata. To combine two
data frames, we will use a join function. The dplyr package
has a number of tools for joining data frames together depending on what
we want to do with the rows of the data of countries that are not
represented in both data frames.
?join
We have a few options here:
The mutating joins add columns from y to x, matching rows based on the keys: - inner_join(): includes all rows in x and y. - left_join(): includes all rows in x. - right_join(): includes all rows in y. - full_join(): includes all rows in x or y.
In an “inner join”, the new data frame only has those rows where the same key is found in both data frames. This is a very commonly used join.
full_data <- inner_join(meta, E_long, by = 'libID')
inner_join the long expression data E_long
with metadata containing only Media samples.full_join?I find this page helpful when working with joins. https://stat545.com/join-cheatsheet.html
ggplot2 is the tidyverse’s main plotting package. Full documentation is available at docs.ggplot2.org
We load ggplot within the tidyverse.
library(tidyverse)
ggplot2 is an implementation of Grammar of Graphics (Wilkinson 1999) for R
Benefits:
Drawbacks:
ggplot functions by adding layers to a plot, somewhat
like how dplyr/tidyr sequentially modify data with the pipe
%>%. Instead, ggplot uses the + to combine
layers such as:
data.frame) of variablesThe idea is to independently specify and combine layers to create the plot you want. There are at least three things we have to specify to create any plot:
A quick reminder about our data. There are bulk RNA-seq contained in
the dat object in file data/dat_voom.RData. Normalized log2
counts per million (CPM) are contained in dat$E and library
metadata are in dat$targets. You will not need the other
pieces of dat for this workshop.
load("data/dat_voom.RData")
dat$E[1:3,1:3]
## pt01_Media pt01_Mtb pt02_Media
## A1BG 1.620022 1.717603 1.086946
## A2M 7.259912 6.680758 5.939909
## A2ML1 -4.052404 -3.515057 -4.015712
dat$targets[1:3,]
## group lib.size norm.factors libID ptID condition age_dys sex
## pt01_Media 1 8295928 1.092872 pt01_Media pt01 Media 12410 M
## pt01_Mtb 1 5716203 0.819490 pt01_Mtb pt01 Mtb 12410 M
## pt02_Media 1 8087601 1.119352 pt02_Media pt02 Media 12775 M
## ptID_old RNAseq methylation total_seq sample.weights
## pt01_Media pt00001 TRUE FALSE 9114402 0.8943797
## pt01_Mtb pt00001 TRUE FALSE 8918699 0.9850363
## pt02_Media pt00002 TRUE FALSE 9221555 1.2582092
Here is a brief description of the variables in the metadata.
libID (character): Unique library ID for each
sequencing library. Formatted as ptID_conditionptID (character): Patient ID. Each patient has 2
libraries which differ in their conditioncondition (character): Experimental condition. Media
for media control. Mtb of M. tuberculosis infected.age_dys (numeric): Age in agessex (character): Biological sex, M or FptID_old (character): Old patient ID with leading
zeroesRNAseq (logical): If the library has pass-filter
RNA-seq datamethylation (logical): If the library has pass-filter
methylation datatotal_seq (numeric): Total number of pass-filter
sequencesWe will start with a boxplot of the total sequences per library. First, we call a ggplot object. Since we’ve given it no data, this is a blank plot.
ggplot()
Then, we give the plot data and specify aesthetics for which parts of the data we want to plot. Now, we see that ggplot knows what the scale of the y data will be.
ggplot(dat$targets) +
aes(y = total_seq) #<--- see change
Finally, we add a geom to denote how we want the data displayed, in this case as a boxplot.
ggplot(dat$targets) +
aes(y = total_seq) +
geom_boxplot() #<--- see change
Since we did not specify an x-variable, all the data are plotted in one boxplot. We add an aesthetic to specify x and split up the data.
ggplot(dat$targets) +
aes(y = total_seq, x = condition) + #<--- see change
geom_boxplot()
These same data could be displayed as a dotplot instead.
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_point() #<--- see change
And if we want to make sure we see every point, a jitter works nicely.
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_jitter() #<--- see change
The building block nature of ggplot allows us to add multiple geoms on top of each other. So we can combine the two previous plots like so
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_boxplot() + #<--- see change
geom_jitter() #<--- see change
But you may notice that there are some duplicate points. Looking back at the first boxplots, there are points for points outside the box. These points remain in the above plot plus appear jittered from the second geom. We can fix this by telling the boxplot to not plot the outliers.
And while this is not an issue in this plot (since there are no
outliers), you should also remove the outliers plotted by
geom_boxplot or else there will be duplicate points with
geom_jitter.
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_boxplot(outlier.shape = NA) + #<--- see change
geom_jitter()
Because ggplot works in layers, the order matters. If for example we
put the boxplot after the jitter, the boxes cover up points! We use
alpha for transparency so you can see these points. Thus,
we careful when considering how you setup your plot!
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_jitter() + #<--- see change
geom_boxplot(outlier.shape = NA, alpha = 0.8) #<--- see change
With jitter, you are adding random error to be able to see all the points. Because this is truly random, every time you run jitter, you will get a slightly different plot. To combat this, set a seed before making the plot.
set.seed(468) #<--- see change
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
set.seed(468) #<--- see change
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
One dotplot of interest in RNA-seq analysis is principle component analysis (PCA). In this plot, each RNA-seq library is a single point. The axes are unitless but in general, points that are closer together represent libraries with more similar gene expression. You can see large scale changes in gene expression in PCA. However, even when you don’t see clear PCA groupings, there can be significant differentially expressed genes. So don’t despair if your data don’t cluster in this plot!
Here, we calculate PCA from the normalized counts. Note that we
transpose t() the counts data to get one PCA point per
library, instead of one per gene.
dat.pca <- prcomp(t(dat$E), scale.=TRUE, center=TRUE)
dat.pca$x[1:3,1:3]
## PC1 PC2 PC3
## pt01_Media -103.91401 -33.55704 22.40058
## pt01_Mtb 52.95006 -74.68146 24.78278
## pt02_Media -60.61507 27.13460 24.58148
Then we use join to add metadata to the PCA data, so we can annotate the PCA plot.
dat.pca.meta <- as.data.frame(dat.pca$x) %>%
rownames_to_column("libID") %>%
inner_join(dat$targets)
## Joining, by = "libID"
A nice trick with joins is to not specify what to join
by. In this case, the function finds all columns shared
between the data frames and uses them to match rows.
With geom_point, we have a PCA!
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2) +
geom_point()
But it’s not very informative, so let’s annotate with our condition groups. In general, you want to use annotations in the following order: space, color, shape, size. These are ordered by how easily the human eye can discern differences.
We already have space as libraries are separated along x and y. Now we add color.
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, shape = condition) + #<--- see change
geom_point()
I like to keep all my aesthetics in one layer, but you could keep
adding aes() functions to the plot if you prefer.
We could also use shape for the same data. Which plot to you think is easier to interpret?
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, shape = condition) + #<--- see change
geom_point()
We’re not showing size or transparency because these aesthetics are difficult to see and should be avoided unless you really need them or the differences between data points are large.
We can also improve our axes and legend labels. First we see what proportion of variation is explained by each PC.
summary(dat.pca)$importance
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 57.21487 38.99933 35.32881 34.21306 29.76604 27.13699
## Proportion of Variance 0.24395 0.11334 0.09301 0.08723 0.06603 0.05488
## Cumulative Proportion 0.24395 0.35729 0.45030 0.53753 0.60356 0.65844
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 24.48881 23.98672 21.83296 21.23941 19.84912 18.33097
## Proportion of Variance 0.04469 0.04288 0.03552 0.03362 0.02936 0.02504
## Cumulative Proportion 0.70313 0.74601 0.78153 0.81515 0.84451 0.86955
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 18.05337 17.40958 17.13667 15.67883 14.48455 14.12701
## Proportion of Variance 0.02429 0.02259 0.02188 0.01832 0.01563 0.01487
## Cumulative Proportion 0.89384 0.91642 0.93831 0.95663 0.97226 0.98713
## PC19 PC20
## Standard deviation 13.14034 1.137607e-13
## Proportion of Variance 0.01287 0.000000e+00
## Cumulative Proportion 1.00000 1.000000e+00
And use that for labels as well as change the legend.
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, color = condition) +
geom_point() +
labs(x = "PC1 (24%)", y = "PC2 (11%)", #<--- see change
color = "Infection")
ggplot has a number of themes that change the aesthetics
of the plot. I like theme_classic for PCA. You can also
checkout more themes in the package ggthemes
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, color = condition) +
geom_point() +
labs(x = "PC1 (24%)", y = "PC2 (11%)",
color = "Infection") +
theme_classic() #<--- see change
Let’s move on to another application of geom_point. A
volcano plot shows differentially expressed gene significance and fold
change.
We’ll need to perform linear modeling to determine these results.
Below is the workflow for comparing Mtb-infected vs media samples,
corrected for sex and blocked by patient. We use our R package
[kimma][https://github.com/BIGslu/kimma]. You DO NOT need to run
these steps. They are here in case you want to run a similar analysis in
the future.
install.packages("devtools")
devtools::install_github("BIGslu/kimma")
library(kimma)
fit <- kmFit(dat, model = "~ condition + sex + (1|ptID)",
run.lme = TRUE, use.weights = TRUE)
lme/lmerel model: expression~condition+sex+(1|ptID)
Input: 20 libraries from 10 unique patients
Model: 20 libraries
Complete: 13419 genes
Failed: 0 genes
save(fit, file="data/kimma_result.RData")
Instead, load the results from the data downloaded for this workshop.
Note that this is an S3 object similar to the dat object.
In it, we have our log2 fold change estimate and
significance FDR for each gene and each variable.
load("data/kimma_result.RData")
fit$lme
## # A tibble: 40,257 × 7
## model gene variable estimate pval FDR gene_biotype
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 lme A1BG condition -0.313 0.264 0.347 protein_coding
## 2 lme A1BG sex 0.0796 0.814 0.954 protein_coding
## 3 lme A1BG (1 | ptID) 0 1 1 protein_coding
## 4 lme A2M condition -0.0940 0.605 0.682 protein_coding
## 5 lme A2M sex 1.40 0.0140 0.158 protein_coding
## 6 lme A2M (1 | ptID) 7.79 0.00525 0.0574 protein_coding
## 7 lme A2ML1 condition 0.791 0.0191 0.0356 protein_coding
## 8 lme A2ML1 sex -2.53 0.00000156 0.000388 protein_coding
## 9 lme A2ML1 (1 | ptID) 1.39 0.239 0.498 protein_coding
## 10 lme A4GALT condition 2.66 0.0000318 0.000108 protein_coding
## # … with 40,247 more rows
Let’s start with a basic dotplot like before. We first filter the fit
data to just the variable condition so that we have 1 point per gene.
Notice how we now pipe %>% the data into ggplot so that
we can modify it in dplyr/tidyr first.
fit$lme %>%
filter(variable == "condition") %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point()
Now let’s color the significant points. There is no “significant” variable in the data frame, so we first make one with mutate.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 ~ "FDR < 0.01", #<--- see change
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) + #<--- see change
geom_point()
Or go further and color up vs down regulated genes differently.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down", #<--- see change
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point()
And we can specify the colors we want with another layer.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) #<--- see change
Since we’re defining significance at a cutoff, we can plot that
cutoff with a vertical line with geom_vline. Other similar
geoms are geom_hline for a horizontal line and
geom_abline for a line with slope a and intercept b.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) #<--- see change
You might also wish to center the 0 estimate or otherwise alter the plots limits.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) #<--- see change
If you accidentally (or purposefully) remove data point with your limits, R will warn you!
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-6, 6)) #<--- see change
## Warning: Removed 14 rows containing missing values (geom_point).
We’re going to use a ggplot extension package to label our most significant genes. Please install and load it like so.
install.packages("ggrepel")
library(ggrepel)
First, we determine what our top 2 up and down regulated genes are.
top_deg <- fit$lme %>%
filter(variable == "condition") %>%
mutate(direction = case_when(estimate < 0 ~ "down",
estimate > 0 ~ "up")) %>%
group_by(direction) %>%
slice_min(FDR, n=2) %>%
pull(gene)
top_deg
## [1] "IFNAR2" "NAIP" "TERF2IP" "ZBTB21"
Then we add those labels to our fit data and use that variable in
geom_text_repel
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label)) #<--- see change
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
And a small edit if you don’t want the text to be colored. When you put an aesthetic in a specific geom, it only applies that change to that one geom.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) + #<--- see change
geom_point(aes(color = significance)) + # <--- see change
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label))
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
To round it out, let’s look at theme_bw, my favorite one
for volcano plots.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point(aes(color = significance)) +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label)) +
theme_bw() #<--- see change
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
There you have it! Did you think you’d reach a 12 function plot so soon?
With big data, we often want to plot more than one thing at once. Facets allow you to make similar plots across multiple subsets of data with only 1 additional line of code. Here, we don’t filter for just the condition variable and instead, facet across 2 of the variables in the model.
fit$lme %>%
filter(variable %in% c("condition", "sex")) %>% #<--- see change
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point(aes(color = significance)) +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label)) +
theme_bw() +
facet_wrap(~ variable) #<--- see change
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 26830 rows containing missing values (geom_text_repel).
Since the y scales differ substantially for the two variable, we
allow them to differ with scales. This can be set to
“free_x”, “free_y”, or “free” for both x and y.
fit$lme %>%
filter(variable %in% c("condition", "sex")) %>% #<--- see change
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point(aes(color = significance)) +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label)) +
theme_bw() +
facet_wrap(~ variable, scales = "free_y") #<--- see change
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 26830 rows containing missing values (geom_text_repel).
full_data <- as.data.frame(dat$E) %>%
rownames_to_column("gene") %>%
pivot_longer(-gene, names_to = "libID", values_to = "log2CPM") %>%
inner_join(dat$targets)
## Joining, by = "libID"
total_seq.
Explore the help pages for scale_color_continuous,
scale_color_gradient, or scale_color_gradient2
and recolor your PCA to more easily see differences. As a reminder, here
is the base PCA plot.ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2) +
geom_point() +
labs(x = "PC1 (24%)", y = "PC2 (11%)",
color = "Infection") +
theme_classic()
geom_text_repel, label the points in the PCA by
their libID